Optimize Document Identifier Assignment for Inverted Index Compression

نویسندگان

Chong Chen

Jing He

Dongdong Shan

Hongfei Yan

چکیده

Document identifier assignment is a technique for inverted file index compression, by reducing d-gap value of posting lists. It was approached by either TSP or clustering methods in existing study. However, there is no proper formulation for this problem and the existing approaches has no theory guarantee to be good approximations. In this paper, we first formulate document identifier assignment problem as an optimization problem, and then propose a new method to solve it approximately. Our method first clusters the documents by URL information and then rearranges the documents and clusters with benefit function, which is derived by minimizing posting space directly. TSP method can be considered as one simple special case of our method. The experiments shows that it achieves a good trade-off between efficiency and effectiveness. Keywords-document identifier, cluster, inverted index compression, optimization

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Sorting Out the Document Identifier Assignment Problem

The compression of Inverted File indexes in Web Search Engines has received a lot of attention in these last years. Compressing the index not only reduces space occupancy but also improves the overall retrieval performance since it allows a better exploitation of the memory hierarchy. In this paper we are going to empirically show that in the case of collections of Web Documents we can enhance ...

متن کامل

Assigning Document Identifiers to Enhance Compressibility of Fulltext Indices

Index compression has been a major issue in the field of Information Retrieval Systems. In particular, due to the impressive figures involved with Web Search Engines (WSEs) the compression of the index is not an option anymore but it has become a must. The most important index compression methods are designed to work for Inverted File (IF) indexes. These methods are based on the assumption that...

متن کامل

I Inverted Index Compression

The data structure at the core of nowadays largescale search engines, social networks, and storage architectures is the inverted index. Given a collection of documents, consider for each distinct term t appearing in the collection the integer sequence `t , listing in sorted order all the identifiers of the documents (docIDs in the following) in which the term appears. The sequence `t is called ...

متن کامل

Unaligned Binary Codes for Index Compression in Schema-Independent Text Retrieval Systems

We examine index compression techniques for schemaindependent inverted files used in text retrieval systems. Schema-independent inverted files contain full positional information for all index terms and allow the structural unit of retrieval to be specified dynamically at query time, rather than statically during index construction. Schemaindependent indices have different characteristics than ...

متن کامل

Document Identifier Reassignment Through Dimensionality Reduction

Most modern retrieval systems use compressed Inverted Files (IF) for indexing. Recent works demonstrated that it is possible to reduce IF sizes by reassigning the document identifiers of the original collection, as it lowers the average distance between documents related to a single term. Variable-bit encoding schemes can exploit the average gap reduction and decrease the total amount of bits p...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2010

Optimize Document Identifier Assignment for Inverted Index Compression

نویسندگان

چکیده

منابع مشابه

Sorting Out the Document Identifier Assignment Problem

Assigning Document Identifiers to Enhance Compressibility of Fulltext Indices

I Inverted Index Compression

Unaligned Binary Codes for Index Compression in Schema-Independent Text Retrieval Systems

Document Identifier Reassignment Through Dimensionality Reduction

عنوان ژورنال:

اشتراک گذاری